Management summary

1. DATA EXPLORATION

The training dataset contains 12795 observations of 16 variables (one index, one response, and 14 predictor variables).
Each record (row) represents a range of parameters of a wine type being sold such as its chemical properties. The continuous response variable TARGET represents the number of cases of wine that are sold as tasting samples to restaurants and wine stores around the United States.

The variables are: –INSERT PICTURE BELOW–

1.1. Univariate analysis

Summaries for the individual variables are provided below.

##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

From the summaries and the chart above we can see that all variables are continuous and that multiple variables have missing data, but the amount of NAs is not very high with the exception of the STARS variable.

A check for near-zero variance did not show a positive result for any variable.

Per-variable distribution analysis is provided below (excluding the INDEX variable, which is immaterial to the analysis and would not be regarded further).

1.2. Bivariate analysis

The pairwise correlations between the continuous variables are displayed below

2. DATA PREPROCESSING

2.1. Data cleaning

3. BUILD MODELS

3.1. Build poisson regression models

3.1.1. Poission model 1

Model summary

From the model summary we can see the following:

Interpretation of the regression coefficients

The diagnostic plots for the model can be generated using the R code provided in the appendix.

3.1.2. Poisson model 2 (

For the second model, the following changes are made:

Model summary

The interpretation of the coefficients has stayed the same as in the full model.

3.2. Build negative binomial regression models

3.2.1. Negative binomial model 1

3.2.2. negative binomial model 2

3.3. Build multiple linear regression models

3.3.1. Multiple linear model 1

3.3.2. Multiple linear model 2

4. MODEL SELECTION

The performance of the continous models will be compared based on RMSE on the out-of sample data

The RMSE for the first (full) model is lower. From the charts below it is clear that the model two consistently produces very low values as compared to the true result.

So the initial full model will be selected for now to produce predictions on the evaluation data. However, further tuning could provide better precision of the predictions.

Predictions on the evaluation dataset

Predictions on the evaluation dataset are made using the model m1_cont.

The output of the model on the evaluated data is available under the following URL:

Appendix

The full R code for the analysis in Rmd format is available under the following URL:

Reference